add options for xccl work #16

Chao1Han · 2025-06-04T06:22:40Z

Fixes #ISSUE_NUMBER

…#157177) Pull Request resolved: pytorch#157177 Approved by: https://github.com/anijain2305

…156761) Differential Revision: [D77459912](https://our.internmc.facebook.com/intern/diff/D77459912) Pull Request resolved: pytorch#156761 Approved by: https://github.com/angelayi

…6398) Fixes test_cuda.py::TestCuda::test_hip_device_count on single gpu scenario Pull Request resolved: pytorch#156398 Approved by: https://github.com/jeffdaily

Fixes pytorch#157146 Pull Request resolved: pytorch#157147 Approved by: https://github.com/weifengpy

Similar to cublas/hipblas, LT now allocates one workspace per handle+stream combo. - fixes hipblaslt issue where memory use increased during graph capture - preserves CUDA env var TORCH_CUBLASLT_UNIFIED_WORKSPACE - moves LT workspace and size from CUDABlas.cpp into CublasHandlePool.cpp, new APIs - size_t getCUDABlasLtWorkspaceSize() - void* getCUDABlasLtWorkspace() Fixes ROCm#2286. Pull Request resolved: pytorch#156495 Approved by: https://github.com/eqy

Closes pytorch#156202 PR adds blackwell support for GroupMM Most of the code that is used for SM90 can be reused, kernel schedule has to be changed in accordance with https://docs.nvidia.com/cutlass/media/docs/cpp/blackwell_functionality.html Did some preliminary benchmarking of H200 vs B200 Script ```py import torch print(torch.__file__) device = torch.device("cuda") dtype = torch.bfloat16 shapes = [ (16, 128000, 7168, 7168), (128, 1, 2048, 7168) ] for batch, M, N, K in shapes: a = torch.randn(batch, M, K, device=device, dtype=dtype) b = torch.randn(batch, N, K, device=device, dtype=dtype) start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) for i in range(5): c = torch._grouped_mm(a, b) num_iter = 50 start_event.record() for i in range(num_iter): c = torch._grouped_mm(a, b) end_event.record() torch.cuda.synchronize() elapsed_time_ms = start_event.elapsed_time(end_event) avg_time_ms = elapsed_time_ms / num_iter print(f"batch: {batch}\tM: {M}\tN: {N}\tK: {K}") print(f"Time per Iteration:\t {avg_time_ms:.4f} ms") ``` On H200 ``` batch: 16 M: 128000 N: 7168 K: 7168 Time per Iteration: 298.6668 ms batch: 128 M: 1 N: 2048 K: 7168 Time per Iteration: 4.1462 ms ``` B200 ``` batch: 16 M: 128000 N: 7168 K: 7168 Time per Iteration: 190.7458 ms batch: 128 M: 1 N: 2048 K: 7168 Time per Iteration: 3.0680 ms ``` nsys nvprof ``` root@16930b42ffc6:/workspace/pytorch# nsys nvprof python gemm_test.py WARNING: python and any of its children processes will be profiled. Collecting data... batch: 16 M: 128000 N: 7168 K: 7168 Time per Iteration: 192.6420 ms batch: 128 M: 1 N: 2048 K: 7168 Time per Iteration: 1.2255 ms Generating '/tmp/nsys-report-6a53.qdstrm' [1/7] [========================100%] report1.nsys-rep [2/7] [========================100%] report1.sqlite [3/7] Executing 'nvtx_sum' stats report SKIPPED: /workspace/pytorch/report1.sqlite does not contain NV Tools Extension (NVTX) data. [4/7] Executing 'cuda_api_sum' stats report Time (%) Total Time (ns) Num Calls Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ------------ ------------ -------- ----------- ------------ --------------------------------- 98.9 10586895744 2 5293447872.0 5293447872.0 73786464 10513109280 7381715954.2 cudaDeviceSynchronize 1.0 104084608 5 20816921.6 33552480.0 100800 34786208 18048125.3 cudaMalloc 0.1 5694304 4 1423576.0 1416656.0 1258560 1602432 181668.1 cudaGetDeviceProperties_v2_v12000 0.1 5430496 130 41773.0 4560.0 2496 3854368 345761.8 cudaLaunchKernel 0.0 587584 110 5341.7 4992.0 4224 16992 1482.0 cudaLaunchKernelExC_v11060 0.0 119200 660 180.6 128.0 96 4128 206.7 cudaGetDriverEntryPoint_v11030 0.0 68352 660 103.6 64.0 32 4928 224.6 cuTensorMapEncodeTiled 0.0 34976 49 713.8 224.0 160 6720 1343.4 cudaStreamIsCapturing_v10000 0.0 32992 4 8248.0 7456.0 4128 13952 4804.4 cudaEventRecord 0.0 16928 4 4232.0 3600.0 1728 8000 2764.7 cudaEventQuery 0.0 16288 4 4072.0 3568.0 1952 7200 2396.1 cudaEventCreateWithFlags 0.0 13632 4 3408.0 2672.0 544 7744 3408.7 cudaEventDestroy 0.0 1056 1 1056.0 1056.0 1056 1056 0.0 cuModuleGetLoadingMode [5/7] Executing 'cuda_gpu_kern_sum' stats report Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Name -------- --------------- --------- ----------- ----------- --------- --------- ----------- ---------------------------------------------------------------------------------------------------- 99.0 10549232845 55 191804233.5 192944479.0 165746368 203645313 5353204.3 void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::Gemm… 0.6 67327135 55 1224129.7 1330656.0 924320 1364928 182180.4 void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::Gemm… 0.3 34854783 20 1742739.1 1597856.0 10080 3899616 818421.2 void at::native::<unnamed>::distribution_elementwise_grid_stride_kernel<float, (int)4, void at::nat… 0.0 354880 110 3226.2 3296.0 1920 4160 554.4 void at::cuda::detail::prepare_grouped_gemm_data<cutlass::bfloat16_t, cutlass::bfloat16_t, cutlass:… ``` The kernel names are too long to be shown via nvprof, I pasted this from nsight systems ``` small kernel 1SM 100.0% 1.286 ms 1 1.286 ms 1.286 ms 1.286 ms 1.286 ms 0 ns void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::GemmUniversal<cutlass::gemm::GroupProblemShape<cute::tuple<int, int, int>>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecialized<(int)3, (int)8, (int)2, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<cute::C<(int)1>, long, cute::C<(int)0>> *, cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_F16BF16_SS<cutlass::bfloat16_t, cutlass::bfloat16_t, float, (int)128, (int)256, (cute::UMMA::Major)0, (cute::UMMA::Major)1, (cute::UMMA::ScaleIn)0, (cute::UMMA::ScaleIn)0>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, void, cute::identity, cute::SM90_TMA_LOAD_MULTICAST, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, >, cute::SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>>, void, void>>>(T1::Params) large kernel 2SM 100.0% 194.178 ms 1 194.178 ms 194.178 ms 194.178 ms 194.178 ms 0 ns void cutlass::device_kernel<at::cuda::detail::enable_3x_kernel_for_sm10<cutlass::gemm::kernel::GemmUniversal<cutlass::gemm::GroupProblemShape<cute::tuple<int, int, int>>, cutlass::gemm::collective::CollectiveMma<cutlass::gemm::MainloopSm100ArrayTmaUmmaWarpSpecialized<(int)5, (int)8, (int)2, cute::tuple<cute::C<(int)2>, cute::C<(int)1>, cute::C<(int)1>>>, cute::tuple<cute::C<(int)256>, cute::C<(int)256>, cute::C<(int)64>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<cute::C<(int)1>, long, cute::C<(int)0>> *, cute::TiledMMA<cute::MMA_Atom<cute::SM100_MMA_F16BF16_2x1SM_SS<cutlass::bfloat16_t, cutlass::bfloat16_t, float, (int)256, (int)256, (cute::UMMA::Major)0, (cute::UMMA::Major)1, (cute::UMMA::ScaleIn)0, (cute::UMMA::ScaleIn)0>>, cute::Layout<cute::tuple<cute::C<(int)1>, cute::C<(int)1>, cute::C<(int)1>>, cute::tuple<cute::C<(int)0>, cute::C<(int)0>, cute::C<(int)0>>>, cute::tuple<cute::Underscore, cute::Underscore, cute::Underscore>>, cute::SM100_TMA_2SM_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, void, cute::identity, cute::SM100_TMA_2SM_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)64>, cute::C<(int)8>>, cute::tuple<cute::C<(int)1>, cute::C<(int)64>>>>, void, cute::identity>, cutlass::epilogue::collective::CollectiveEpilogue<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::bfloat16_t, cute::tuple<long, cute::C<(int)1>, cute::C<(int)0>> *, cutlass::epilogue::fusion::FusionCallbacks<cutlass::epilogue::Sm100PtrArrayTmaWarpSpecialized<(int)4, (int)2, (int)64, (bool)1, (bool)0>, cutlass::epilogue::fusion::LinearCombination<cutlass::bfloat16_t, float, cutlass::bfloat16_t, float, (cutlass::FloatRoundStyle)2>, cute::tuple<cute::C<(int)128>, cute::C<(int)256>, cute::C<(int)64>>, cute::tuple<cute::Layout<cute::C<(int)128>, cute::C<(int)1>>, cute::Layout<cute::C<(int)64>, cute::C<(int)1>>>, >, cute::SM100::TMEM::LOAD::SM100_TMEM_LOAD_32dp32b64x, cute::SM90_TMA_LOAD, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::SM90_TMA_STORE, cute::ComposedLayout<cute::Swizzle<(int)3, (int)4, (int)3>, cute::smem_ptr_flag_bits<(int)16>, cute::Layout<cute::tuple<cute::C<(int)8>, cute::C<(int)64>>, cute::tuple<cute::C<(int)64>, cute::C<(int)1>>>>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>, cute::AutoVectorizingCopyWithAssumedAlignment<(int)128>>, void, void>>>(T1::Params) ``` Pull Request resolved: pytorch#156203 Approved by: https://github.com/syed-ahmed, https://github.com/drisspg

…ke --version` (pytorch#157073) `cmake -E capabilities` produces a JSON format that is more machine-friendly. ```console $ cmake --version cmake version 4.0.3 CMake suite maintained and supported by Kitware (kitware.com/cmake). $ cmake -E capabilities | jq '.version.string' "4.0.3" $ cmake -E capabilities | jq { "debugger": true, "fileApi": { "requests": [ { "kind": "codemodel", "version": [ { "major": 2, "minor": 8 } ] }, { "kind": "configureLog", "version": [ { "major": 1, "minor": 0 } ] }, { "kind": "cache", "version": [ { "major": 2, "minor": 0 } ] }, { "kind": "cmakeFiles", "version": [ { "major": 1, "minor": 1 } ] }, { "kind": "toolchains", "version": [ { "major": 1, "minor": 0 } ] } ] }, "generators": [ { "extraGenerators": [], "name": "Watcom WMake", "platformSupport": false, "toolsetSupport": false }, { "extraGenerators": [ "Kate" ], "name": "Ninja Multi-Config", "platformSupport": false, "toolsetSupport": false }, { "extraGenerators": [ "CodeBlocks", "CodeLite", "Eclipse CDT4", "Kate", "Sublime Text 2" ], "name": "Ninja", "platformSupport": false, "toolsetSupport": false }, { "extraGenerators": [], "name": "Xcode", "platformSupport": false, "toolsetSupport": true }, { "extraGenerators": [ "CodeBlocks", "CodeLite", "Eclipse CDT4", "Kate", "Sublime Text 2" ], "name": "Unix Makefiles", "platformSupport": false, "toolsetSupport": false } ], "serverMode": false, "tls": true, "version": { "isDirty": false, "major": 4, "minor": 0, "patch": 3, "string": "4.0.3", "suffix": "" } } ``` Pull Request resolved: pytorch#157073 Approved by: https://github.com/Skylion007

…#156742) Resolves: - pytorch#155998 (comment) Pull Request resolved: pytorch#156742 Approved by: https://github.com/malfet

) Pull Request resolved: pytorch#155886 Approved by: https://github.com/zou3519

Creating contiguous strides creates an expression max(1, x). Often we know that x >= 1, in which case we should simplify max(1, x) to x. This appeared in two situations: 1) An internal user complained about statically_known_true(x == max(1, x)) failing (internal link: https://fb.workplace.com/groups/1028545332188949/permalink/1232958568414290). This pytorch#155938 won't be needed with this. 3) Not simplifying the above could result in wrong ConstraintViolationErrors. Because we assume non-trival single arg guards shall evaporate see the logic in the function issue_guard in symbolic_shapes.py with this change we longer throw ConstraintViolationErrors with the program bellow this is blocking landing this [PR](pytorch#155590) from landing internally. Due to internal export tests throwing ConstraintViolationErrors. like ``` Constraints violated (width)! - Not all values of width = L['x'].size()[3] in the specified range 224 <= width <= 455 satisfy the generated guard max(1, 1 + (((-1) + L['x'].size()[3]) // 2)) == (1 + (((-1) + L['x'].size()[3]) // 2)). ```` ``` x = torch.rand(10) torch._dynamo.mark_dynamic(x, 0, max=20, min=5) @torch.compile(fullgraph=True, dynamic=True) def func(x): if max(1, (-1 + x.size()[0]//2)) == (-1+x.size()[0]//2): return x*400 else: return (x*10)*100 func(x) ``` Pull Request resolved: pytorch#157189 Approved by: https://github.com/pianpwk

Fixes pytorch#155046 This change allows Cholesky inversion to use rocSOLVER. This is now also the default on ROCm for Cholesky inversion which aligns with the behavior on NVIDIA (which defaults to cuSOLVER for this linear algebra operation). This fix also gets around a memory access fault encountered in MAGMA for large matrices. MAGMA can still be forced on ROCm by doing: ``` torch.backends.cuda.preferred_linalg_library(backend='magma') ``` Ran all Cholesky UT on ROCm and there were no regressions. Pull Request resolved: pytorch#157154 Approved by: https://github.com/jeffdaily

Summary: This diff updates DCP driver code/APIs to support Zero Overhead Checkpointing Test Plan: Test with TorchTitan on this PR: pytorch/torchtitan#1287 Differential Revision: D72391401 Pull Request resolved: pytorch#156207 Approved by: https://github.com/teja-rao

…istograms. (pytorch#156457) Fixes pytorch#156414 Pull Request resolved: pytorch#156457 Approved by: https://github.com/jerryzh168

Pull Request resolved: pytorch#156603 Approved by: https://github.com/msaroufim

Prereqs: - pytorch#152708 Features: 1. Adds inductor's estimate of flops and bandwidth to the json trace events that perfetto uses. 1. Only use the tflops estimation from triton if we don't have the info from the datasheet because Triton's estimates are inaccurate. I have a backlog item to fix triton flops estimation upstream. New `DeviceInfo` class, and new function `get_device_tflops`. 1. New helpers `countable_fx` and `count_flops_fx` helps get the flops of an `fx.Node`. 1. Extends Triton `torch.profiler` logging to `DebugAutotuner`. 1. New script `profile_analysis.py`: `--augment_trace` adds perf estimates to any perfetto json trace, `--analyze` creates a summary table of these perf estimates, and `--diff` will compare two traces side by side: ```python Device(NVIDIA H100, 0): Kernel Name | resnet Kernel Count | resnet FLOPS | resnet bw gbps | resnet Dur (ms) | resnet Achieved FLOPS % | resnet Achieved Bandwidth % | newresnet Kernel Count | newresnet FLOPS | newresnet bw gbps | newresnet Dur (ms) | newresnet Achieved FLOPS % | newresnet Achieved Bandwidth % --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- triton_poi_fused__native_batch_norm_legi | 24 | 0 | 0.11395268248131513 | 2.5919166666666666 | 0 | 0.003401572611382541 | 24 | 0 | 0.11395268248131513 | 2.5919166666666666 | 0 | 0.003401572611382541 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 142 | 16932673552.422373 | 0.2585007824198784 | 12.441619718309857 | 0.08683422334575583 | 0.007716441266265022 | 142 | 16932673552.422373 | 0.2585007824198784 | 12.441619718309857 | 0.08683422334575583 | 0.007716441266265022 triton_red_fused__native_batch_norm_legi | 39 | 0 | 0.13990024992108846 | 5.752589743589743 | 0 | 0.004176126863316074 | 39 | 0 | 0.13990024992108846 | 5.752589743589743 | 0 | 0.004176126863316074 triton_poi_fused__native_batch_norm_legi | 25 | 0 | 0.31824055917536503 | 2.5291999999999994 | 0 | 0.009499718184339253 | 25 | 0 | 0.31824055917536503 | 2.5291999999999994 | 0 | 0.009499718184339253 void cutlass::Kernel2<cutlass_80_tensoro | 98 | 16211056473.596165 | 0.42972434051025826 | 7.130408163265306 | 0.08313362294151874 | 0.012827592254037562 | 98 | 16211056473.596165 | 0.42972434051025826 | 7.130408163265306 | 0.08313362294151874 | 0.012827592254037562 triton_red_fused__native_batch_norm_legi | 73 | 0 | 0.3225381327611705 | 9.987068493150682 | 0 | 0.009628003963020014 | 73 | 0 | 0.3225381327611705 | 9.987068493150682 | 0 | 0.009628003963020014 triton_poi_fused__native_batch_norm_legi | 15 | 0 | 1.4491211346487216 | 4.439333333333333 | 0 | 0.043257347302946926 | 15 | 0 | 1.4491211346487216 | 4.439333333333333 | 0 | 0.043257347302946926 void cutlass::Kernel2<cutlass_80_tensoro | 186 | 14501701145.337954 | 0.2667131401910989 | 7.873865591397849 | 0.07436769818122027 | 0.007961586274361157 | 186 | 14501701145.337954 | 0.2667131401910989 | 7.873865591397849 | 0.07436769818122027 | 0.007961586274361157 triton_poi_fused__native_batch_norm_legi | 33 | 0 | 1.4924556538193923 | 4.3101515151515155 | 0 | 0.044550915039384846 | 33 | 0 | 1.4924556538193923 | 4.3101515151515155 | 0 | 0.044550915039384846 triton_red_fused__native_batch_norm_legi | 29 | 0 | 0.25562590522631107 | 6.296275862068965 | 0 | 0.007630624036606301 | 29 | 0 | 0.25562590522631107 | 6.296275862068965 | 0 | 0.007630624036606301 triton_poi_fused__native_batch_norm_legi | 13 | 0 | 0.5870562174192726 | 2.7397692307692307 | 0 | 0.01752406619162008 | 13 | 0 | 0.5870562174192726 | 2.7397692307692307 | 0 | 0.01752406619162008 triton_poi_fused__native_batch_norm_legi | 34 | 0 | 0.41409928846284 | 2.853588235294117 | 0 | 0.012361172789935523 | 34 | 0 | 0.41409928846284 | 2.853588235294117 | 0 | 0.012361172789935523 triton_per_fused__native_batch_norm_legi | 34 | 0 | 0.11705315007018151 | 3.460647058823529 | 0 | 0.0034941238826919864 | 34 | 0 | 0.11705315007018151 | 3.460647058823529 | 0 | 0.0034941238826919864 triton_poi_fused__native_batch_norm_legi | 16 | 0 | 0.17207853197124584 | 2.3459375000000002 | 0 | 0.005136672596156592 | 16 | 0 | 0.17207853197124584 | 2.3459375000000002 | 0 | 0.005136672596156592 triton_per_fused__native_batch_norm_legi | 30 | 0 | 0.2639714322022256 | 6.131199999999999 | 0 | 0.007879744244842555 | 30 | 0 | 0.2639714322022256 | 6.131199999999999 | 0 | 0.007879744244842555 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 100 | 11875430356.891787 | 0.19494470869421385 | 16.36534 | 0.06089964285585531 | 0.005819245035648175 | 100 | 11875430356.891787 | 0.19494470869421385 | 16.36534 | 0.06089964285585531 | 0.005819245035648175 triton_poi_fused__native_batch_norm_legi | 8 | 0 | 0.9854096626224687 | 3.2757500000000004 | 0 | 0.029415213809625928 | 8 | 0 | 0.9854096626224687 | 3.2757500000000004 | 0 | 0.029415213809625928 void cublasLt::splitKreduce_kernel<32, 1 | 56 | 34377923395.147064 | 0.8310300045762317 | 3.4199999999999986 | 0.17629704305203628 | 0.024806865808245714 | 56 | 34377923395.147064 | 0.8310300045762317 | 3.4199999999999986 | 0.17629704305203628 | 0.024806865808245714 triton_poi_fused__native_batch_norm_legi | 23 | 0 | 0.9944002965861103 | 3.2431304347826084 | 0 | 0.02968359094286896 | 23 | 0 | 0.9944002965861103 | 3.2431304347826084 | 0 | 0.02968359094286896 triton_per_fused__native_batch_norm_legi | 10 | 0 | 0.1826801058931057 | 4.428800000000001 | 0 | 0.00545313748934644 | 10 | 0 | 0.1826801058931057 | 4.428800000000001 | 0 | 0.00545313748934644 triton_poi_fused__native_batch_norm_legi | 10 | 0 | 0.3168973585366449 | 2.5471999999999997 | 0 | 0.009459622642884923 | 10 | 0 | 0.3168973585366449 | 2.5471999999999997 | 0 | 0.009459622642884923 triton_poi_fused__native_batch_norm_legi | 34 | 0 | 1.1463614897015777 | 4.124323529411764 | 0 | 0.03421974596124114 | 34 | 0 | 1.1463614897015777 | 4.124323529411764 | 0 | 0.03421974596124114 void cask_plugin_cudnn::xmma_cudnn::init | 44 | 44045510816.64277 | 2.0661232850348643 | 3.6887499999999993 | 0.22587441444432194 | 0.06167532194133924 | 44 | 44045510816.64277 | 2.0661232850348643 | 3.6887499999999993 | 0.22587441444432194 | 0.06167532194133924 sm90_xmma_fprop_implicit_gemm_f32f32_tf3 | 95 | 7876855400.165316 | 0.4694941555946739 | 18.224315789473682 | 0.04039413025725802 | 0.014014750913273854 | 95 | 7876855400.165316 | 0.4694941555946739 | 18.224315789473682 | 0.04039413025725802 | 0.014014750913273854 triton_per_fused__native_batch_norm_legi | 41 | 0 | 0.06825669875995298 | 3.0384146341463416 | 0 | 0.002037513395819492 | 41 | 0 | 0.06825669875995298 | 3.0384146341463416 | 0 | 0.002037513395819492 triton_poi_fused__native_batch_norm_legi | 23 | 0 | 0.08808154712430301 | 2.3275652173913044 | 0 | 0.0026292999141582997 | 23 | 0 | 0.08808154712430301 | 2.3275652173913044 | 0 | 0.0026292999141582997 triton_per_fused__native_batch_norm_legi | 40 | 0 | 0.18179321034952417 | 4.556825 | 0 | 0.005426662995508183 | 40 | 0 | 0.18179321034952417 | 4.556825 | 0 | 0.005426662995508183 triton_poi_fused__native_batch_norm_legi | 15 | 0 | 0.5887415155454232 | 2.783866666666667 | 0 | 0.017574373598370836 | 15 | 0 | 0.5887415155454232 | 2.783866666666667 | 0 | 0.017574373598370836 void cutlass::Kernel2<cutlass_80_tensoro | 38 | 14242013806.264643 | 0.256592404353939 | 7.217631578947369 | 0.0730359682372546 | 0.007659474756834 | 38 | 14242013806.264643 | 0.256592404353939 | 7.217631578947369 | 0.0730359682372546 | 0.007659474756834 triton_poi_fused__native_batch_norm_legi | 21 | 0 | 0.5842860973430516 | 2.7779047619047623 | 0 | 0.017441376040091088 | 21 | 0 | 0.5842860973430516 | 2.7779047619047623 | 0 | 0.017441376040091088 triton_per_fused__native_batch_norm_legi | 16 | 0 | 0.11509365173486417 | 3.5959375000000002 | 0 | 0.0034356313950705724 | 16 | 0 | 0.11509365173486417 | 3.5959375000000002 | 0 | 0.0034356313950705724 triton_poi_fused__native_batch_norm_legi | 14 | 0 | 0.1704672000243914 | 2.4044285714285714 | 0 | 0.00508857313505646 | 14 | 0 | 0.1704672000243914 | 2.4044285714285714 | 0 | 0.00508857313505646 triton_poi_fused__native_batch_norm_legi | 58 | 0 | 2.307520779930795 | 8.190706896551722 | 0 | 0.06888121731136704 | 58 | 0 | 2.307520779930795 | 8.190706896551722 | 0 | 0.06888121731136704 triton_per_fused__native_batch_norm_legi | 29 | 0 | 0.037243248971881276 | 3.0277586206896556 | 0 | 0.001111738775280038 | 29 | 0 | 0.037243248971881276 | 3.0277586206896556 | 0 | 0.001111738775280038 triton_poi_fused__native_batch_norm_legi | 20 | 0 | 0.04741699795428918 | 2.2911500000000005 | 0 | 0.0014154327747549007 | 20 | 0 | 0.04741699795428918 | 2.2911500000000005 | 0 | 0.0014154327747549007 triton_per_fused__native_batch_norm_legi | 25 | 0 | 0.13357016893727824 | 3.37536 | 0 | 0.003987169222008305 | 25 | 0 | 0.13357016893727824 | 3.37536 | 0 | 0.003987169222008305 triton_poi_fused__native_batch_norm_legi | 13 | 0 | 0.3089862268300253 | 2.8111538461538457 | 0 | 0.009223469457612694 | 13 | 0 | 0.3089862268300253 | 2.8111538461538457 | 0 | 0.009223469457612694 triton_poi_fused__native_batch_norm_legi | 17 | 0 | 0.3129385387909844 | 2.673 | 0 | 0.009341448919133863 | 17 | 0 | 0.3129385387909844 | 2.673 | 0 | 0.009341448919133863 triton_per_fused__native_batch_norm_legi | 19 | 0 | 0.2215568162533158 | 3.8837368421052636 | 0 | 0.0066136363060691275 | 19 | 0 | 0.2215568162533158 | 3.8837368421052636 | 0 | 0.0066136363060691275 std::enable_if<!(false), void>::type int | 23 | 504916805.19297093 | 1.0118296096314707 | 8.113913043478261 | 0.0025893169497075447 | 0.030203868944223014 | 23 | 504916805.19297093 | 1.0118296096314707 | 8.113913043478261 | 0.0025893169497075447 | 0.030203868944223014 triton_poi_fused_add_copy__38 | 56 | 0 | 0 | 2.132482142857143 | 0 | 0 | 56 | 0 | 0 | 2.132482142857143 | 0 | 0 triton_poi_fused_convolution_0 | 18 | 0 | 0.43458610794936897 | 2.773333333333334 | 0 | 0.012972719640279667 | 18 | 0 | 0.43458610794936897 | 2.773333333333334 | 0 | 0.012972719640279667 triton_poi_fused_convolution_1 | 17 | 0 | 0.028816312469162712 | 2.6145882352941174 | 0 | 0.0008601884319153051 | 17 | 0 | 0.028816312469162712 | 2.6145882352941174 | 0 | 0.0008601884319153051 void convolve_common_engine_float_NHWC<f | 44 | 8641868995.31118 | 0.024730540008465626 | 25.87327272727273 | 0.04431727689903169 | 0.0007382250748795709 | 44 | 8641868995.31118 | 0.024730540008465626 | 25.87327272727273 | 0.04431727689903169 | 0.0007382250748795709 triton_per_fused__native_batch_norm_legi | 12 | 0 | 0.6809930918986744 | 4.82675 | 0 | 0.020328151996975356 | 12 | 0 | 0.6809930918986744 | 4.82675 | 0 | 0.020328151996975356 triton_per_fused__native_batch_norm_legi | 14 | 0 | 0.02883030597936608 | 2.6651428571428575 | 0 | 0.0008606061486377935 | 14 | 0 | 0.02883030597936608 | 2.6651428571428575 | 0 | 0.0008606061486377935 triton_per_fused__native_batch_norm_legi | 16 | 0 | 0.0014658988233201874 | 2.098 | 0 | 4.375817383045335e-05 | 16 | 0 | 0.0014658988233201874 | 2.098 | 0 | 4.375817383045335e-05 triton_poi_fused__native_batch_norm_legi | 13 | 0 | 0.9926297180284697 | 3.2367692307692306 | 0 | 0.02963073785159611 | 13 | 0 | 0.9926297180284697 | 3.2367692307692306 | 0 | 0.02963073785159611 triton_poi_fused__native_batch_norm_legi | 9 | 0 | 1.3008817095666507 | 3.0863333333333336 | 0 | 0.03883228983781048 | 9 | 0 | 1.3008817095666507 | 3.0863333333333336 | 0 | 0.03883228983781048 void at::native::(anonymous namespace):: | 98 | 0 | 0.09174335613709389 | 4.408520408163265 | 0 | 0.0027386076458833994 | 98 | 0 | 0.09174335613709389 | 4.408520408163265 | 0 | 0.0027386076458833994 void at::native::vectorized_elementwise_ | 7 | 0 | 0 | 1.7278571428571428 | 0 | 0 | 7 | 0 | 0 | 1.7278571428571428 | 0 | 0 ``` Pull Request resolved: pytorch#149697 Approved by: https://github.com/eellison, https://github.com/shunting314

…ch#157204) current failing with ``` (/home/bobren/local/a/pytorch-env) [13:02] devgpu009:/home/bobren/local/a/pytorch python test/inductor/test_compile_subprocess.py -k GPUTests.test_async /home/bobren/local/a/pytorch/torch/backends/cudnn/__init__.py:115: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system. warnings.warn( /home/bobren/local/a/pytorch/torch/_inductor/ops_handler.py:741: UserWarning: undefined OpHandler.__getstate__, please add missing op schema warnings.warn(f"undefined OpHandler.{name}, please add missing op schema") /home/bobren/local/a/pytorch/torch/_inductor/ops_handler.py:741: UserWarning: undefined OpHandler.__getstate__, please add missing op schema warnings.warn(f"undefined OpHandler.{name}, please add missing op schema") W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] Unable to pickle input graph or example inputs W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] Traceback (most recent call last): W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx_ext.py", line 484, in serialize_compile W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] ).serialize() W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] File "/home/bobren/local/a/pytorch/torch/_inductor/compile_fx_ext.py", line 210, in serialize W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] return _WireProtocolPickledInput(GraphPickler.dumps(self)) W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] File "/home/bobren/local/a/pytorch/torch/fx/_graph_pickler.py", line 124, in dumps W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] pickler.dump(obj) W0628 13:02:30.666000 3610483 torch/_inductor/compile_fx_ext.py:491] [0/0] AttributeError: Can't pickle local object 'make_opaque_bitwise_fn.<locals>.BitwiseFn' ``` Pull Request resolved: pytorch#157204 Approved by: https://github.com/aorenste

…6378) Summary: noticed that context quantized_engine is accessed and written from multiple threads Test Plan: ➜ fbsource buck test --flagfile fbcode/mode/dev-tsan //xplat/assistant/integration_test/tests/supernova/speechtranslation:live_speech_translation_en_fr_tests -- --exact 'fbsource//xplat/assistant/integration_test/tests/supernova/speechtranslation:live_speech_translation_en_fr_tests - Translate/LiveSpeechTranslationTests.LiveSpeechTranslationEnFr/silence___fr_en' Rollback Plan: Differential Revision: D76921416 Pull Request resolved: pytorch#156378 Approved by: https://github.com/jerryzh168, https://github.com/cyyever

Fixes pytorch#156988 Pull Request resolved: pytorch#157206 Approved by: https://github.com/Skylion007

…h#157212) Summary: Fixes a gap in the Triton update where the traverse would break because `get_tma_stores` didn't handle both TMA APIs. Test Plan: `buck test -m ovr_config//triton:beta 'fbcode//mode/dev-nosan' fbcode//ads_mkl/ops/tests:gdpa_dcpp_test -- --exact 'ads_mkl/ops/tests:gdpa_dcpp_test - test_gdpa_dcpp (ads_mkl.ops.tests.gdpa_dcpp_test.GdpaDCPPTest)'` Rollback Plan: Differential Revision: D77501582 Pull Request resolved: pytorch#157212 Approved by: https://github.com/davidberard98

…7186) Differential Revision: D77476551 Pull Request resolved: pytorch#157186 Approved by: https://github.com/burak-turk

…ytorch#156742)" This reverts commit 2380115. Reverted pytorch#156742 on behalf of https://github.com/malfet due to Looks like it broke all ROCM tests, see https://hud.pytorch.org/hud/pytorch/pytorch/721d2580dbf9a4922adc1c7d1cc8237126d3cdd6/1?per_page=50&name_filter=inductor-rocm&mergeEphemeralLF=true ([comment](pytorch#156742 (comment)))

The job has been unstable Pull Request resolved: pytorch#157214 Approved by: https://github.com/laithsakka

Summary: pretty simple. if planner exists, which implies that planning is enabled, create a manager for each frame. the associated serial executor will use the withMemoryPlannner fn to ensure the deallocation is done after execution completes. Test Plan: CI Differential Revision: D73635809 Pull Request resolved: pytorch#157053 Approved by: https://github.com/henryoier, https://github.com/georgiaphillips

Summary: D76832520 switched Executorch to use the caffe c10 headers. This copy contains a shadow, which is treated as an error for certain embedded compile flows. Simple rename to avoid. Test Plan: CI Rollback Plan: Differential Revision: D77446104 Pull Request resolved: pytorch#157107 Approved by: https://github.com/Skylion007

The old ASAN jobs have been replaced. Pull Request resolved: pytorch#157197 Approved by: https://github.com/Skylion007

…near_cuda_float16 (pytorch#156962) Fixes pytorch#156514 Pull Request resolved: pytorch#156962 Approved by: https://github.com/jamesjwu

…callbacks (pytorch#157185) Summary: Since we increment the counter after performing the callback, it leads to the assertion error when callback raises an error and increment never happens. Let's increment first to avoid it. Test Plan: tba Rollback Plan: Differential Revision: D77475650 Pull Request resolved: pytorch#157185 Approved by: https://github.com/xmfan

Summary: To debug pytorch#156930. Not able to reproduce the problem locally. Pull Request resolved: pytorch#157203 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <[email protected]>

This PR makes minimal changes to support sparse tensors on MPS. In the followup PRs I'll start adding different operations slowly so we can fix the issue of pytorch#129842 which is highly requested(I assume because of whisper using sparse tensors) Pull Request resolved: pytorch#157238 Approved by: https://github.com/malfet

Fixes pytorch#155199 The issue on main is due an outdated version of matplotlib. I have bumped the version so that it is compatible with Numpy 2.0 Pull Request resolved: pytorch#155931 Approved by: https://github.com/malfet

…ch#157630) This contains 2 fixes that required in main and will need to be cherry-picked to Release 2.8 branch: 1. The PR pytorch#155819 missed to include triton change. 2. CUDA STABLE variable needs to be set to 12.8. Updating CUDA stable updates full static build Pull Request resolved: pytorch#157630 Approved by: https://github.com/Skylion007, https://github.com/jeanschmidt

Pull Request resolved: pytorch#153943 Approved by: https://github.com/malfet

…> `setuptools.build_meta` (pytorch#155998) Change `build-system.build-backend`: `setuptools.build_meta:__legacy__` -> `setuptools.build_meta`. Also, move static package info from `setup.py` to `pyproject.toml`. Now the repo can be installed from source via `pip` command instead of `python setup.py develop`: ```bash python -m pip install --verbose --editable . python -m pip install --verbose --no-build-isolation --editable . ``` In addition, the SDist is also buildable: ```bash python -m build --sdist python -m install dist/torch-*.tar.gz # build from source using SDist ``` Note that we should build the SDist with a fresh git clone if we will upload the output to PyPI. Because all files under `third_party` will be included in the SDist. The SDist file will be huge if the git submodules are initialized. Pull Request resolved: pytorch#155998 Approved by: https://github.com/ezyang, https://github.com/cyyever, https://github.com/atalman ghstack dependencies: pytorch#157557

This unblocks pytorch/test-infra#6869. The key changes to call out: * B200 needs OIDC to access ECR and upload stats to S3, so we need to set `id-token: write` in `_linux-test`. All workflows calling `_linux-test` also need to be updated accordingly * Connecting sccache to S3 on B200 doesn't seem to work, so I disable it. It still works locally though. ### Testing https://github.com/pytorch/pytorch/actions/runs/16055549292/job/45312298376 Pull Request resolved: pytorch#157341 Approved by: https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet

`maybe_initialize_env_vars` and `initialize_nvshmem_with_store` are only used in `NVSHMEMSymmetricMemory.cu`. Moving them there. Pull Request resolved: pytorch#157611 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#157513

…157575) This PR addresses a minor typo in the documentation file aten/src/ATen/cuda/tunable/README.md, where paramters has been corrected to parameters for improved clarity and consistency. Context Accurate and clear documentation is crucial for helping developers and contributors understand PyTorch internals. This small fix contributes to the overall quality and readability of the project. Thank you to the PyTorch team and maintainers for your continued efforts in building such an incredible framework. I'm happy to contribute in any way I can — even if just with a small doc improvement like this one. Pull Request resolved: pytorch#157575 Approved by: https://github.com/eqy

…157629) This PR addresses a typo in the file `test/mobile/model_test/gen_test_model.py`. ### Changes: - Corrected "occurances" to the correct spelling "occurrences" - Renamed associated variables to reflect this change for consistency and clarity This is a non-functional, cleanup-only PR to improve code readability. Thanks to the PyTorch team for maintaining such a high-quality codebase Pull Request resolved: pytorch#157629 Approved by: https://github.com/Skylion007

Fixes #ISSUE_NUMBER This PR fixes a small punctuation issue in the PyTorch README. Specifically: Added a missing full stop at the end of the sentence: "Note: You could refer to the cuDNN Support Matrix for cuDNN versions with the various supported CUDA, CUDA driver and NVIDIA hardware." Added comma for clarity between "CUDA driver" and "NVIDIA hardware". These edits improve the readability and grammatical correctness of the documentation. Pull Request resolved: pytorch#157623 Approved by: https://github.com/Skylion007

Fixes pytorch#157653 Pull Request resolved: pytorch#157654 Approved by: https://github.com/Skylion007, https://github.com/malfet

Fixes pytorch#157653 Pull Request resolved: pytorch#157655 Approved by: https://github.com/Skylion007, https://github.com/cyyever

They might have been slow on CUDA-11.3, but this version of CUDA is long gone. More fundamental underlying issue were linear complexity of the recursive polynomial definitions for higher order polynomials, for example see this loop from implementation of Chebyshev polynomial of the first kind https://github.com/pytorch/pytorch/blob/7081b8233a64c350c64e9f00c9b9d00e52020241/aten/src/ATen/native/Math.h#L2969-L2973 which were tested by `test_compare_cpu` using following values (as sample index 16) https://github.com/pytorch/pytorch/blob/7081b8233a64c350c64e9f00c9b9d00e52020241/torch/testing/_internal/opinfo/core.py#L2079 Luckily chebyshev polynomials for absolute values higher than 1 pretty quickly reach infinity, see below ``` python3 -c "import torch;print(torch.special.chebyshev_polynomial_v(torch.nextafter(torch.tensor(1.0), torch.tensor(2.0)), torch.tensor(1e6)))" tensor(nan) ``` Which is not the case for Laguerre polynomials, but it's probably fine to just limit it to 1e7 Before ``` $ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_ ssssssss..ssssss..ssssss..ssssssssssssssssssssss..ssssss/home/ubuntu/py3.10-nightly/lib/python3.10/site-packages/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /pytorch/aten/src/ATen/Context.cpp:78.) return torch._C._get_cublas_allow_tf32() ....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssssssssssss..ssssss..ssssssssssssssssssssssssssssss..ssssss....ssssssssssss..ssssss..ssssss............ssssssssssssssssssssssssssssssssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssssssssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssss..ssssssssssssss ---------------------------------------------------------------------- Ran 432 tests in 8.575s OK (skipped=344) ``` After ``` $ PYTORCH_TEST_WITH_SLOW=1 python test_ops.py -k chebyshev_polynomial_ ssssssss........................ssssssssssssssss......../home/ubuntu/pytorch/torch/backends/cuda/__init__.py:131: UserWarning: This API is going to be deprecated, please see https://pytorch.org/docs/main/notes/cuda.html#tensorfloat-32-tf32-on-ampere-and-later-devices (Triggered internally at /home/ubuntu/pytorch/aten/src/ATen/Context.cpp:78.) return torch._C._get_cublas_allow_tf32() ........................................................................................xxxxxxxx................ssssssssssssssssssssssss........................................................................................................ssssssss........................ssssssss........................................................................................ssssssss ---------------------------------------------------------------------- Ran 432 tests in 45.580s OK (skipped=72, expected failures=8) ``` Fixes pytorch#79528 Pull Request resolved: pytorch#157464 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: pytorch#157488

Pull Request resolved: pytorch#157614 Approved by: https://github.com/aorenste ghstack dependencies: pytorch#157305

@aorenste

…h#157619) followup from pytorch#157305 where @aorenste correctly suggested clearing callback. this refactor introduces a new dataclass so we don't need to check nullability for each field Pull Request resolved: pytorch#157619 Approved by: https://github.com/aorenste ghstack dependencies: pytorch#157305, pytorch#157614

Pull Request resolved: pytorch#157650 Approved by: https://github.com/aorenste ghstack dependencies: pytorch#157305, pytorch#157614, pytorch#157619

…ytorch#157362) This pull request fixes a minor typo in the doc comments of `test/nn/test_parametrization.py`. - Replaced `'Intializing'` with `'Initializing'` in two docstring comments to improve clarity and maintain consistency across the codebase. This is a non-functional change and does not impact behavior or test outcomes. Thank you for maintaining such a high-quality codebase. Please let me know if any adjustments are needed. I'd be happy to help! Pull Request resolved: pytorch#157362 Approved by: https://github.com/ezyang

…ytorch#157646) This PR addresses a minor typo in the file `test/quantization/fx/test_model_report_fx.py`: - Corrected the word "paramter" to "parameter" for better readability and accuracy. While it's a small change, correcting such typographical errors contributes to maintaining the overall quality and professionalism of the codebase. Thank you for your time and consideration in reviewing this PR. I'm happy to make any further adjustments if needed. Pull Request resolved: pytorch#157646 Approved by: https://github.com/yewentao256, https://github.com/ezyang

…ytorch#157136) Summary: - Extract symbolic variables directly from graph placeholders and arguments - Add symbolic variable definitions to generated repro code - Add unit tests with ToyModel for testing Pull Request resolved: pytorch#157136 Approved by: https://github.com/xmfan ghstack dependencies: pytorch#157021

Fixes pytorch#157564 Fixes misspelling of the word parameter in documentation Pull Request resolved: pytorch#157565 Approved by: https://github.com/awgu, https://github.com/cyyever

…rs (see 157266) (pytorch#157632) Pull Request resolved: pytorch#157632 Approved by: https://github.com/ezyang, https://github.com/Skylion007

…ytorch#157666) When CC and CXX compiler is set to clang, and clang was compiled with libc++, compilation of torchvision fails with: ``` File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 585, in build_extensions compiler_name, compiler_version = self._check_abi() ^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 1034, in _check_abi _, version = get_compiler_abi_compatibility_and_version(compiler) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.12/site-packages/torch/utils/cpp_extension.py", line 449, in get_compiler_abi_compatibility_and_version if tuple(map(int, version)) >= minimum_required_version: ^^^^^^^^^^^^^^^^^^^^^^^^ ValueError: invalid literal for int() with base 10: '7+libcxx' ``` Compiler identification is a valid semantic version: ``` $ clang -dumpfullversion -dumpversion 20.1.7+libcxx ``` After adjusting parser of version, clang is able to compile extensions successfully. Fixes pytorch#157665 Pull Request resolved: pytorch#157666 Approved by: https://github.com/msaroufim

Pull Request resolved: pytorch#157638 Approved by: https://github.com/yewentao256, https://github.com/jansel

Pull Request resolved: pytorch#157639 Approved by: https://github.com/yewentao256, https://github.com/jansel ghstack dependencies: pytorch#157638

…s before copying (pytorch#154369) Fixes pytorch#151223 Because FSDP stores original parameters as views into a flattened tensor, changing the flattened parameter’s tensor directly can desynchronize the views. With the NO_SHARD strategy this caused a shape mismatch error when writing back modified parameters. Ensured writeback handles NO_SHARD correctly by flattening tensors before copying. The logic now flattens the source parameter or gradient when the strategy is unsharded to maintain the expected 1‑D shape for writeback operations Pull Request resolved: pytorch#154369 Approved by: https://github.com/weifengpy

To be used by CPU INT8 SDPA in torchao. pytorch/ao#2380 Pull Request resolved: pytorch#156065 Approved by: https://github.com/mingfeima, https://github.com/ezyang

…pytorch#127294) When `torch.backends.mkldnn.matmul.fp32_precision=='bf16'`, we also enabled mkldnn linear in inductor path and allow to run with bf16 computation data type. Testplan: ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_unary python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_fp32 python test/inductor/test_mkldnn_pattern_matcher.py -k test_multi_linear_share_same_input ``` Pull Request resolved: pytorch#127294 Approved by: https://github.com/jgong5, https://github.com/jansel Co-authored-by: Jiang, Yanbing <[email protected]>

…torch#156600) Don't call `sum()` on a tensor that is default constructed. Previously we could call `sum()` on a tensor that was default-contructed. That would lead to an error like this: ``` Traceback (most recent call last): File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 58, in testPartExecutor yield File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 634, in run self._callTestMethod(testMethod) File "/home/ahmads/.conda/envs/pt3/lib/python3.12/unittest/case.py", line 589, in _callTestMethod if method() is not None: ^^^^^^^^ File "/home/ahmads/personal/pytorch/torch/testing/_internal/common_utils.py", line 3191, in wrapper method(*args, **kwargs) File "/home/ahmads/personal/pytorch/test/test_nn.py", line 7235, in test_layer_norm_backwards_eps ln_out_cuda.backward(grad_output_cuda) File "/home/ahmads/personal/pytorch/torch/_tensor.py", line 647, in backward torch.autograd.backward( File "/home/ahmads/personal/pytorch/torch/autograd/__init__.py", line 354, in backward _engine_run_backward( File "/home/ahmads/personal/pytorch/torch/autograd/graph.py", line 829, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: tensor does not have a device Exception raised from device_default at /home/ahmads/personal/pytorch/c10/core/TensorImpl.h:1265 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) from ??:0 #7 at::TensorBase::options() const from :0 #8 at::meta::resize_reduction(at::impl::MetaBase&, at::Tensor const&, c10::OptionalArrayRef<long>, bool, c10::ScalarType, bool) from :0 #9 at::meta::structured_sum_dim_IntList::meta(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #10 at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #11 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>), &at::(anonymous namespace)::wrapper_CompositeExplicitAutogradNonFunctional_sum_dim_IntList>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType> > >, at::Tensor (at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from RegisterCompositeExplicitAutogradNonFunctional_0.cpp:0 #12 at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) from ??:0 #13 void at::native::(anonymous namespace)::LaunchGammaBetaBackwardCUDAKernel<float, float>(float const*, float const*, float const*, float const*, long, long, at::Tensor*, at::Tensor*, CUstream_st*) from ??:0 #14 void at::native::(anonymous namespace)::LayerNormBackwardKernelImplInternal<float>(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0 #15 at::native::(anonymous namespace)::LayerNormBackwardKernelImpl(at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, at::Tensor*, at::Tensor*, at::Tensor*) from ??:0 #16 at::native::layer_norm_backward_cuda(at::Tensor const&, at::Tensor const&, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from ??:0 #17 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA__native_layer_norm_backward(at::Tensor const&, at::Tensor const&, c10::ArrayRef<c10::SymInt>, at::Tensor const&, at::Tensor const&, std::optional<at::Tensor> const&, std::optional<at::Tensor> const&, std::array<bool, 3ul>) from RegisterCUDA_0.cpp:0 ``` Now we only call `sum(0)` on tensors that are defined and properly guard the `sum(0)` and assignment. Pull Request resolved: pytorch#156600 Approved by: https://github.com/eqy, https://github.com/ngimel

ydwu4 and others added 30 commits June 28, 2025 18:14

[invoke_subgraph] turn on supports_input_mutation by default (pytorch…

064a7db

…#157177) Pull Request resolved: pytorch#157177 Approved by: https://github.com/anijain2305

[schema_upgrader] add C++ upgrader for json based upgrading (pytorch#…

aeffb68

…156761) Differential Revision: [D77459912](https://our.internmc.facebook.com/intern/diff/D77459912) Pull Request resolved: pytorch#156761 Approved by: https://github.com/angelayi

[ROCm] test_hip_device_count safely runs on 1 GPU systems (pytorch#15…

67f8270

…6398) Fixes test_cuda.py::TestCuda::test_hip_device_count on single gpu scenario Pull Request resolved: pytorch#156398 Approved by: https://github.com/jeffdaily

Fix FSDP offload pin_memory bug (pytorch#157147)

0629dfb

Fixes pytorch#157146 Pull Request resolved: pytorch#157147 Approved by: https://github.com/weifengpy

[BE] use pathlib.Path instead of os.path.* in setup.py (pytorch…

2380115

…#156742) Resolves: - pytorch#155998 (comment) Pull Request resolved: pytorch#156742 Approved by: https://github.com/malfet

[hop] support torch.func.functional_call in hop subgraph (pytorch#155886

836bb19

) Pull Request resolved: pytorch#155886 Approved by: https://github.com/zou3519

Fixes issue pytorch#156414: Fixes bug in implementation of _combine_h…

1913c91

…istograms. (pytorch#156457) Fixes pytorch#156414 Pull Request resolved: pytorch#156457 Approved by: https://github.com/jerryzh168

[BE][13/16] fix typos in torch/ (torch/ao/) (pytorch#156603)

f829311

Pull Request resolved: pytorch#156603 Approved by: https://github.com/msaroufim

Increase tolerance for test_corrcoef_cuda_int32 (pytorch#157206)

b147b6c

Fixes pytorch#156988 Pull Request resolved: pytorch#157206 Approved by: https://github.com/Skylion007

[dynamo][callbacks] temporarily disable TRITON_AUTOTUNING (pytorch#15…

721d258

…7186) Differential Revision: D77476551 Pull Request resolved: pytorch#157186 Approved by: https://github.com/burak-turk

Update pr_time_benchmarks expected results (pytorch#157214)

41f6ace

The job has been unstable Pull Request resolved: pytorch#157214 Approved by: https://github.com/laithsakka

Remove old ASAN Docker images (pytorch#157197)

c27f83d

The old ASAN jobs have been replaced. Pull Request resolved: pytorch#157197 Approved by: https://github.com/Skylion007

[inductor] Increase tolerance for test_comprehensive_nn_functional_li…

12cb06e

…near_cuda_float16 (pytorch#156962) Fixes pytorch#156514 Pull Request resolved: pytorch#156962 Approved by: https://github.com/jamesjwu

[AOTI] Print out error msg when nvcc compiler fails (pytorch#157203)

771be85

Summary: To debug pytorch#156930. Not able to reproduce the problem locally. Pull Request resolved: pytorch#157203 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <[email protected]>

atalman and others added 27 commits July 4, 2025 18:08

Fix pytorch#153942 (pytorch#153943)

9968edd

Pull Request resolved: pytorch#153943 Approved by: https://github.com/malfet

[Refactor] Remove unused variables (pytorch#157654)

f7127b9

Fixes pytorch#157653 Pull Request resolved: pytorch#157654 Approved by: https://github.com/Skylion007, https://github.com/malfet

[Refactor] Add maybe unused flag to remove warning (pytorch#157655)

63e87d6

Fixes pytorch#157653 Pull Request resolved: pytorch#157655 Approved by: https://github.com/Skylion007, https://github.com/cyyever

[pc] migrate progression futures from list to deque (pytorch#157614)

5ea832e

Pull Request resolved: pytorch#157614 Approved by: https://github.com/aorenste ghstack dependencies: pytorch#157305

[pc] verify max autotune is in generated source code (pytorch#157650)

2471cc3

Pull Request resolved: pytorch#157650 Approved by: https://github.com/aorenste ghstack dependencies: pytorch#157305, pytorch#157614, pytorch#157619

Fixing misspelling in documentation (pytorch#157565)

ee9ac36

Fixes pytorch#157564 Fixes misspelling of the word parameter in documentation Pull Request resolved: pytorch#157565 Approved by: https://github.com/awgu, https://github.com/cyyever

More testing of Python arithmetic operators between tensors and scala…

3e56a9c

…rs (see 157266) (pytorch#157632) Pull Request resolved: pytorch#157632 Approved by: https://github.com/ezyang, https://github.com/Skylion007

[BE][4/6] fix typos in test/ (test/inductor/) (pytorch#157638)

17687eb

Pull Request resolved: pytorch#157638 Approved by: https://github.com/yewentao256, https://github.com/jansel

[BE][5/6] fix typos in test/ (test/dynamo/) (pytorch#157639)

02715d0

Pull Request resolved: pytorch#157639 Approved by: https://github.com/yewentao256, https://github.com/jansel ghstack dependencies: pytorch#157638

Support transpose and pack for bit8 (pytorch#156065)

d26ca5d

To be used by CPU INT8 SDPA in torchao. pytorch/ao#2380 Pull Request resolved: pytorch#156065 Approved by: https://github.com/mingfeima, https://github.com/ezyang

add options for xccl work

836ed81

add comm split support

da44942

Chao1Han force-pushed the xccl-option branch from 4408f8f to da44942 Compare July 7, 2025 08:33

Chao1Han closed this Jul 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add options for xccl work #16

add options for xccl work #16

Uh oh!

Chao1Han commented Jun 4, 2025

Uh oh!

Uh oh!

add options for xccl work #16

add options for xccl work #16

Uh oh!

Conversation

Chao1Han commented Jun 4, 2025

Uh oh!

Uh oh!